Skip to content

cp: Fix: perf script ddp nccl-ub (2158) into r0.3.0#2217

Merged
ko3n1g merged 2 commits intor0.3.0from
cherry-pick-2158-r0.3.0
Feb 6, 2026
Merged

cp: Fix: perf script ddp nccl-ub (2158) into r0.3.0#2217
ko3n1g merged 2 commits intor0.3.0from
cherry-pick-2158-r0.3.0

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 4, 2026

beep boop [🤖]: Hi @youngeunkwon0405 👋,

we've cherry picked #2158 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Chores
    • Enhanced NCCL runtime configuration handling with additional policy settings.
    • Refactored performance configuration system to improve code organization and maintainability.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 4, 2026

/ok to test 1c987a0

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Configuration changes to NCCL runtime settings: Added NCCL_CTA_POLICY when UCX/UB is enabled. Refactored NCCL UB handling by separating it from Megatron FSDP overrides into a dedicated helper function, maintaining functional equivalence.

Changes

Cohort / File(s) Summary
NCCL Configuration Settings
scripts/performance/setup_experiment.py
Added NCCL_CTA_POLICY policy setting to custom environment variables when NCCL UB is enabled, expanding NCCL runtime configuration options.
NCCL UB Override Refactoring
scripts/performance/utils/overrides.py
Decoupled NCCL UB handling from Megatron FSDP by removing nccl_ub parameter from _set_megatron_fsdp_overrides, introducing new _set_nccl_ub_overrides helper function, and updating callers to apply configurations independently.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested labels

r0.3.0, bug

Suggested reviewers

  • erhoo82
  • malay-nagda
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR description lacks documentation of test results, performance metrics, or regression verification for NCCL configuration changes. Add test results and regression verification to PR description demonstrating the fix resolves NCCL UB without introducing regressions.
Title check ❓ Inconclusive The title references the fix ID (2158) and target branch but is unclear about the actual technical change being fixed. Clarify what the fix addresses, e.g., 'Fix NCCL UB configuration in performance script DDP' or 'Fix NCCL_CTA_POLICY setting in performance profiling scripts'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-2158-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405
Copy link
Copy Markdown
Contributor

Hi @ko3n1g, should we re-trigger Run CICD since it got a fix and check if it works fine now?

@ko3n1g ko3n1g merged commit 241572b into r0.3.0 Feb 6, 2026
1 check passed
@ko3n1g ko3n1g deleted the cherry-pick-2158-r0.3.0 branch February 6, 2026 18:15
@coderabbitai coderabbitai bot mentioned this pull request Feb 11, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants